AITopics | textual reasoning

Collaborating Authors

textual reasoning

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning

Neural Information Processing SystemsDec-25-2025, 05:22:09 GMT

Does prompting a large language model (LLM) like GPT-3 with explanations improve in-context learning? We study this question on two NLP tasks that involve reasoning over text, namely question answering and natural language inference. We test the performance of four LLMs on three textual reasoning datasets using prompts that include explanations in multiple different styles. For these tasks, we find that including explanations in the prompts for OPT, GPT-3 (davinci), and InstructGPT (text-davinci-001) only yields small to moderate accuracy improvements over standard few-show learning. However, text-davinci-002 is able to benefit more substantially.We further show that explanations generated by the LLMs may not entail the models' predictions nor be factually grounded in the input, even on simple tasks with extractive explanations. However, these flawed explanations can still be useful as a way to verify LLMs' predictions post-hoc. Through analysis in our three settings, we show that explanations judged by humans to be good--logically consistent with the input and the prediction--more likely cooccur with accurate predictions. Following these observations, we train calibrators using automatically extracted scores that assess the reliability of explanations, allowing us to improve performance post-hoc across all of our datasets.

explanation, few-shot prompting, unreliability, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering

Felizzi, Federico, Riccomi, Olivia, Ferramola, Michele, Causio, Francesco Andrea, Del Medico, Manuel, De Vita, Vittorio, De Mori, Lorenzo, Piscitelli, Alessandra, Risuleo, Pietro Eric, Castaniti, Bianca Destro, Cristiano, Antonio, Longo, Alessia, De Angelis, Luigi, Vassalli, Mariapia, Di Pumpo, Marcello

arXiv.org Artificial IntelligenceDec-1-2025

Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.

accuracy, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2511.1922

Country:

Europe > Italy > Lazio > Rome (0.05)
Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)
Europe > France (0.04)

Genre: Research Report > New Finding (0.49)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

Guo, Ziyu, Zhang, Renrui, Li, Hongyu, Zhang, Manyuan, Chen, Xinyan, Wang, Sifan, Feng, Yan, Pei, Peng, Heng, Pheng-Ann

arXiv.org Artificial IntelligenceNov-21-2025

Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2511.16671

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing

Zhang, Ruiyang, Luo, Jiahao, Feng, Xiaoru, Pang, Qiufan, Yang, Yaodong, Dai, Juntao

arXiv.org Artificial IntelligenceOct-30-2025

With the rapid advancement of text-to-image (T2I) models, ensuring their safety has become increasingly critical. Existing safety approaches can be categorized into training-time and inference-time methods. While inference-time methods are widely adopted due to their cost-effectiveness, they often suffer from limitations such as over-refusal and imbalance between safety and utility. To address these challenges, we propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module, enabling efficient safety alignment for any text-to-image model. Central to this framework is MR-SafeEdit, a multi-round image-text interleaved dataset specifically constructed for safety editing in text-to-image generation. We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content. To instantiate this paradigm, we develop SafeEditor, a unified MLLM capable of multi-round safety editing on generated images. Experimental results show that SafeEditor surpasses prior safety approaches by reducing over-refusal while achieving a more favorable safety-utility balance.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.2482

Country:

North America > Mexico > Mexico City > Mexico City (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Wang, Haozhe, Su, Alex, Ren, Weiming, Lin, Fangzhen, Chen, Wenhu

arXiv.org Artificial IntelligenceOct-27-2025

Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.15966

Country:

North America > United States > Alabama > Jefferson County > Birmingham (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning

Chen, Yongchao, Liu, Yueying, Zhou, Junwei, Hao, Yilun, Wang, Jingquan, Zhang, Yang, Li, Na, Fan, Chuchu

arXiv.org Artificial IntelligenceOct-1-2025

Practical guidance on training Large Language Models (LLMs) to leverage Code Interpreter across diverse tasks remains lacking. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.1% to 72.4%, Notably, R1-CI-14B also exhibits emergent self-checking behavior through code generation. While reinforcement learning (RL)-based fine-tuning has significantly improved LLMs' reasoning and planning Wang In contrast, symbolic code generation handles these rigorously and benefits from external tools (e.g., A key challenge is guiding LLMs to decide when to rely on textual reasoning versus programmatic solutions, given that most input questions lack explicit cues about which approach is best and the possible text/code solution space is large. OpenAI's GPT models address this by incorporating a Code Interpreter, allowing iterative code generation Interpreter implementations struggle to effectively steer between text and code, underutilizing symbolic capabilities. Recent work such as ToRL (Li et al., 2025b) and ReTool (Feng et al., 2025) investigates training reasoning models to integrate with Code Interpreters. To tackle these challenges, we present R1-Code-Interpreter, a framework for integrating Code Interpreter into open-source LLMs. We curate 144 reasoning and planning tasks and synthesize 6.5k multi-turn text/code trajectories for This difficulty arises from task heterogeneity and the scarcity of effective samples.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2505.21668

Country:

Asia > Vietnam > Hanoi > Hanoi (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
(2 more...)

Genre:

Instructional Material (0.66)
Research Report (0.50)

Industry: Leisure & Entertainment > Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought

Wen, Junjie, Zhu, Minjie, Liu, Jiaming, Liu, Zhiyuan, Yang, Yicun, Zhang, Linfeng, Zhang, Shanghang, Zhu, Yichen, Xu, Yi

arXiv.org Artificial IntelligenceOct-1-2025

Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages a multimodal chain-of-thought to unify visual perception, language reasoning, and robotic control in a single system. For practical deployment, we mitigate inference latency by incorporating two acceleration strategies--a prefix attention mask and key-value (KV) caching--yielding up to 2 speedup at test-time inference. We evaluate dVLA in both simulation and the real world: on the LIBERO benchmark it achieves state-of-the-art performance with a 96.4% average success rate, consistently surpassing both discrete and continuous action policies; on a real Franka robot, it succeeds across a diverse task suite, including a challenging bin-picking task that requires multi-step planning, demonstrating robust real-world performance. Together, these results underscore the promise of unified diffusion frameworks for practical, high-performance VLA robotics. The development of VLA models has undergone two stages of evolution. In the first stage, a pre-trained vision-language backbone is used purely as a feature extractor, and the extracted features are mapped directly to robot actions. As vanilla VLA architectures proved inadequate for open-world instruction following and long-horizon tasks, a second-stage training paradigm co-trains on image-text data alongside action trajectories to preserve knowledge from the pre-trained VLM and, when necessary, to predict both sub-step reasoning and robot actions (Zhou et al., 2025b;a; Intelligence et al., 2025b; Driess et al., 2025).

artificial intelligence, arxiv preprint arxiv, dvla, (15 more...)

arXiv.org Artificial Intelligence

2509.25681

Country:

North America > Montserrat (0.04)
Europe > Netherlands > South Holland > Delft (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge

Jin, Guanghao, Wu, Jingpei, Guo, Tianpei, Niu, Yiyi, Zhou, Weidong, Liu, Guoyang

arXiv.org Artificial IntelligenceAug-21-2025

Referring Expression Comprehension (REC) is a popular multimodal task that aims to accurately detect target objects within a single image based on a given textual expression. However, due to the limitations of earlier models, traditional REC benchmarks either rely solely on intra-image cues or lack sufficiently fine-grained instance annotations, making them inadequate for evaluating the reasoning capabilities of Multi-modal Large Language Models (MLLMs). To address this gap, we propose a new benchmark, KnowDR-REC, characterized by three key features: Firstly, it is built upon real-world knowledge, requiring fine-grained multimodal reasoning across text and image. Secondly, the dataset includes elaborately constructed negative samples via fine-grained expression editing, designed to evaluate a model's robustness and anti-hallucination ability. Lastly, we introduce three novel evaluation metrics to systematically explore the model's internal reasoning process. We evaluate 16 state-of-the-art mul-timodal models on KnowDR-REC, with experimental results showing that existing MLLMs still struggle with knowledge-driven visual grounding tasks. Furthermore, we observe a de-coupling between textual understanding and visual grounding in MLLMs, where many models are significantly influenced by memorized shortcut correlations, which severely affect their behavior on our benchmark and hinder genuine mul-timodal reasoning. We anticipate that the proposed benchmark will inspire future research towards developing more robust, interpretable, and knowledge-intensive visual grounding frameworks, driving the development of more reliable and robust multimodal systems for complex real-world scenarios.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2508.1408

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy

Zhao, Zhonghan, Zhang, Wenwei, Huang, Haian, Liu, Kuikun, Gao, Jianfei, Wang, Gaoang, Chen, Kai

arXiv.org Artificial IntelligenceMar-31-2025

Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy. Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments, and thus exhibits more than $17\times$ sample efficiency improvements and generalization in comparison with previous works. During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions. Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.24388

Country:

Europe > Sweden > Skåne County > Malmö (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.70)

Industry: Leisure & Entertainment > Games (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.88)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

Chen, Yongchao, Hao, Yilun, Liu, Yueying, Zhang, Yang, Fan, Chuchu

arXiv.org Artificial IntelligenceFeb-4-2025

Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-round guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-round supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at https://github.com/yongchao98/CodeSteer-v1.0.

codesteer, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.0435

Country:

North America > United States > Illinois > Champaign County > Urbana (0.14)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback